Retain underlying error on proxy mode connection failure by ywangd · Pull Request #94998 · elastic/elasticsearch

ywangd · 2023-04-04T05:35:14Z

When a remote cluster node fails to connect in sniff mode, the
underlying error is preserved and reported. But in proxy mode, the
underlying error is swallowed and it reports only a generic "Unable to
open any proxy connections" error.

This PR improves the error reporting with proxy mode to align with that
for the sniff mode. This allows consistent error message in the API
response and clearer diagnositic for underlying issues, e.g.
authentication failure, incompatible license etc.

When a remote cluster node fails to connect in sniff mode, the underlying error is preserved and reported. But in proxy mode, the underlying error is swallowed and it reports only a generic "Unable to open any proxy connections" error. This PR improves the error reporting with proxy mode to align with that for the sniff mode. This allows consistent error message in the API response and clearer diagnositic for underlying issues, e.g. authentication failure, incompatible license etc.

elasticsearchmachine · 2023-04-04T05:35:38Z

Hi @ywangd, I've created a changelog YAML for you.

elasticsearchmachine · 2023-04-04T05:35:39Z

Pinging @elastic/es-security (Team:Security)

elasticsearchmachine · 2023-04-04T05:35:39Z

Pinging @elastic/es-distributed (Team:Distributed)

ywangd · 2023-04-04T05:40:38Z

server/src/main/java/org/elasticsearch/transport/ProxyConnectionStrategy.java

-                        openConnections(finished, attemptNumber + 1);
+                        if (attemptNumber >= MAX_CONNECT_ATTEMPTS_PER_RUN && connectionManager.size() == 0) {
+                            logger.warn(() -> "failed to open any proxy connections to cluster [" + clusterAlias + "]", e);
+                            finished.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias, e));
+                        } else {
+                            openConnections(finished, attemptNumber + 1);


This change basically tries to mirror how error is handled in the sniff mode so that the underlying cause of the failure is retained and reported.

With this change, do we still need to check connectionManager.size() in the else branch below? I.e. can we remove this cause-free exception?

finished.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias));

Or maybe we should have the attemptNumber >= MAX_CONNECT_ATTEMPTS_PER_RUN check in the onResponse handler above too.

With this change, do we still need to check connectionManager.size() in the else branch below?

In practice, we should not need it. But in theory, it is possible that the if (attemptNumber <= MAX_CONNECT_ATTEMPTS_PER_RUN) check fails in the first invocation and goes to the else branch.

We have a similar else branch in SniffConnectionStrategy as well. Therefore I didn't remove it.

Right, yes, but since this only affects the first invocation (and only if there are no seed nodes or some other weird config?) I think it'd be better to handle it up-front. That would clarify that once we get here we never return a cause-free NoSeedNodeLeftException.

ywangd · 2023-04-04T05:41:33Z

server/src/main/java/org/elasticsearch/transport/SniffConnectionStrategy.java

                }
                logger.warn(() -> "fetching nodes from external cluster [" + clusterAlias + "] failed", e);
-                listener.onFailure(e);
+                listener.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias, e));


For consistency between the two modes, I am also updating this to be wrapped within a NoSeedNodeLeftException which makes the error message more explanatory in REST response.

I think this only makes sense if seedNodesSuppliers.hasNext() == false. If we encountered some other more fundamental error then we didn't run out of seed nodes here.

Also IMO it would be good to eliminate the seedNodesSuppliers.hasNext() == false case entirely in this method so that there's always a cause to report.

I think this only makes sense if seedNodesSuppliers.hasNext() == false
Good catch! Will fix.

Also IMO it would be good to eliminate the seedNodesSuppliers.hasNext() == false case entirely in this method so that there's always a cause to report.

Sorry but I don't quite follow. There is no existing seedNodesSuppliers.hasNext() == false case. Are you referring to the else branch of the openning if (seedNodesSuppliers.hasNext()), i.e. related to the above comment? I think in practice the else branch never runs. So the cause is always reported. Please let me know if you'd like to drop the else branch for both Sniff and Proxy strategies. I can make the change to both. I just want them to be consistent.

Yes I meant the else branch. I wonder if it could run if seedNodesSuppliers is empty. Not sure if this can happen but maybe it could in future even if not today? I'd like to protect against that at least, but again I think it's better to do so up-front rather than on every iteration.

ywangd · 2023-04-04T05:43:03Z

server/src/test/java/org/elasticsearch/transport/SniffConnectionStrategyTests.java

-                    expectThrows(Exception.class, connectFuture::actionGet);
+                    final NoSeedNodeLeftException exception = (NoSeedNodeLeftException) expectThrows(
+                        ExecutionException.class,
+                        connectFuture::get
+                    ).getCause();


Sniff mode was throwing a ConnectTransportException exception directly. As commented above, I changed it to be wrapped inside a NoSeedNodeLeftException.

DaveCTurner

Makes sense to me. I left a couple of inline thoughts. Moreover:

I've seen cases where the last error wasn't the useful one - could we do more to report the other errors encountered along the way (maybe collecting them as suppressed exceptions?
If the inner exception is a ConnectTransportException then it will report the address to which it was trying to connect, but other exceptions will not include this and I think that makes further investigation much harder. Could we make sure that we include the target address always?

server/src/main/java/org/elasticsearch/transport/NoSeedNodeLeftException.java

DaveCTurner · 2023-04-04T07:41:54Z

server/src/main/java/org/elasticsearch/transport/ProxyConnectionStrategy.java

-                        openConnections(finished, attemptNumber + 1);
+                        if (attemptNumber >= MAX_CONNECT_ATTEMPTS_PER_RUN && connectionManager.size() == 0) {
+                            logger.warn(() -> "failed to open any proxy connections to cluster [" + clusterAlias + "]", e);
+                            finished.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias, e));
+                        } else {
+                            openConnections(finished, attemptNumber + 1);


With this change, do we still need to check connectionManager.size() in the else branch below? I.e. can we remove this cause-free exception?

finished.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias));

Or maybe we should have the attemptNumber >= MAX_CONNECT_ATTEMPTS_PER_RUN check in the onResponse handler above too.

DaveCTurner · 2023-04-04T07:51:39Z

server/src/main/java/org/elasticsearch/transport/SniffConnectionStrategy.java

                }
                logger.warn(() -> "fetching nodes from external cluster [" + clusterAlias + "] failed", e);
-                listener.onFailure(e);
+                listener.onFailure(new NoSeedNodeLeftException(strategyType(), clusterAlias, e));


I think this only makes sense if seedNodesSuppliers.hasNext() == false. If we encountered some other more fundamental error then we didn't run out of seed nodes here.

Also IMO it would be good to eliminate the seedNodesSuppliers.hasNext() == false case entirely in this method so that there's always a cause to report.

…message

ywangd · 2023-04-06T09:07:01Z

@DaveCTurner I pushed an update that should address your comments. Could you please take another look to see whether the changes match your suggestion? Thanks!

DaveCTurner · 2023-04-06T11:25:23Z

...src/javaRestTest/java/org/elasticsearch/xpack/remotecluster/RemoteClusterSecurityRestIT.java

-                    .put("cluster.remote.invalid_remote.mode", "proxy")
-                    .put("cluster.remote.invalid_remote.proxy_address", fulfillingCluster.getRemoteClusterServerEndpoint(0))
-                    .build()
+                randomBoolean() && false


&& false maybe leftover from a debugging session?

...g/elasticsearch/xpack/remotecluster/RemoteClusterSecurityLicensingAndFeatureUsageRestIT.java

n1v0lg · 2023-04-06T13:53:05Z

server/src/main/java/org/elasticsearch/transport/SniffConnectionStrategy.java

+                if (seedNodesSuppliers.hasNext()) {
+                    listener.onFailure(getNoSeedNodeLeftException(Set.of(e)));
+                } else {
+                    listener.onFailure(e);


Shouldn't this be flipped? I.e., if there are no seed nodes left, we want a NoSeedNodeLeftException or no?

Went back to the original thread and I see the intention now: if we hit an exception that we think we can't recover from by moving on to a different seed node, we don't want to wrap in NoSeedNodeLeftException here.

Given that, the behavior is still confusing though: suppose we have two seed nodes, and we fail with a ConnectTransportException on the first one, then with a "terminating" exception on the second node. In that case, we don't wrap the exception. However, if we had three seed nodes, and failed on the second with a "terminating" exception, we would wrap.

Sorry if I'm missing something here! If I am, it would be great to include a comment in the code with a little more context.

As a side note: not wrapping on "terminating" exceptions is what seems to produce the inconsistency in status codes between proxy mode and sniff mode (e.g., for the license failures, for get a 500 in proxy mode, but a 403 in sniff mode). I'm wondering if we should introduce a second class of failure that still produces a 500 for the case where we don't want to return a NoSeedNodeLeftException.

…message

ywangd · 2023-04-11T08:03:07Z

Just pushed an update to revert some changes to SniffConnectionStrategy. I think it's better to focus on proxy mode in this PR as suggested in the title.

ywangd · 2023-04-17T06:28:00Z

@DaveCTurner @n1v0lg Friendly reminder for review at your convenience. As commented above, I have decided to take a minimal approach with this PR to focus on its primary goal. Happy to work on follow-ups once this one is through.

n1v0lg

LGTM. Sorry for the delay!

From my end, I'm good with only updating proxy mode, and leaving out the sniff mode improvements. Likewise, good with not further refactoring proxy connection strategy in this PR.

n1v0lg · 2023-04-19T19:12:26Z

server/src/main/java/org/elasticsearch/transport/SniffConnectionStrategy.java

        }
    }

+    private NoSeedNodeLeftException getNoSeedNodeLeftException(Collection<Exception> suppressedExceptions) {


Since we're only ever passing an empty set here, we could just remove the suppressedExceptions arg altogether.

Since I have reduced scope of this PR. This method is no longer necessary. I inlined it which also removes the need of suppressedExceptions. Thanks for spotting it.

n1v0lg · 2023-04-19T19:21:34Z

server/src/main/java/org/elasticsearch/transport/ProxyConnectionStrategy.java

                    if (countDown.countDown()) {
-                        openConnections(finished, attemptNumber + 1);
+                        if (attemptNumber >= MAX_CONNECT_ATTEMPTS_PER_RUN && connectionManager.size() == 0) {
+                            logger.warn(() -> "failed to open any proxy connections to cluster [" + clusterAlias + "]", e);


Sanity checking that warn is the right log level: this could be caused by longer-lasting but still temporary network glitches, so I'm wondering if we want info or even debug here instead.

warn is good. info shouldn't really be used to report exceptions, and debug is hidden by default so will make troubleshooting too difficult. See these docs for more info.

cheers for the pointer, all good to keep this at warn here!

…message

ywangd · 2023-04-20T11:30:25Z

@elasticmachine update branch

ywangd · 2023-04-21T01:28:39Z

@DaveCTurner Do you still plan to review the PR? I'd like to get it merged before 8.8 FF. Thanks!

ywangd · 2023-04-24T23:06:44Z

@elasticmachine update branch

ywangd · 2023-04-24T23:14:05Z

I'll be merging this PR. Happy to work on follow ups as discussed.

ywangd · 2023-04-24T23:49:10Z

Failure is #95514

ywangd · 2023-04-25T00:49:22Z

@elasticmachine update branch

ywangd · 2023-04-25T01:52:28Z

@elasticmachine update branch

ywangd · 2023-04-25T05:02:30Z

@elasticmachine run elasticsearch-ci/docs-check

This PR removes a bogus assertion from ProxyConnectionStrategy. The statement tries to assert that all connection exceptions are caught in the if branch and we should have at least one connection when the code reaches the else branch. This is simply not true because the process of checking whether we should open more connections and subsequently openning connections are *not* atomic. It is possible that the number of connections changes, e.g. remote end drops the connection, in between which makes the code flow go to the else branch and trips the bogus assertion. Relates: elastic#94998 Resolves: elastic#99113

Today, when the number of attempts is exhausted, ProxyConnectionStrategy checks the number of connections before returns. It reports connection failure if the number of connections is zero at the time of checking. However, this behaviour is incorrect. In rare cases, a connection can be dropped right after it is initially established and before the number checking. From the perspective of the `openConnections` method, it should not care whether or when opened connections are subsequently closed. As long as connections have been initially established, it should report success instead of failure. This PR adjusts the code to report success in above situation. Relates: #94998 Resolves: #99113

ywangd added 2 commits April 4, 2023 14:15

wip

c5a5000

ywangd added >enhancement :Distributed/Network Http and internode communication implementations :Security/Security Security issues without another label v8.8.0 labels Apr 4, 2023

ywangd requested review from Tim-Brooks and n1v0lg April 4, 2023 05:35

elasticsearchmachine added Team:Distributed Meta label for distributed team. Team:Security Meta label for security team labels Apr 4, 2023

Update docs/changelog/94998.yaml

05ff23e

ywangd commented Apr 4, 2023

View reviewed changes

ywangd added 2 commits April 4, 2023 15:45

update changelog

82f3872

tweak

79d724a

DaveCTurner reviewed Apr 4, 2023

View reviewed changes

ywangd added 2 commits April 6, 2023 16:53

Merge remote-tracking branch 'origin/main' into rcs-proxy-mode-error-…

c364c81

…message

address feedback

35cd0d6

ywangd requested a review from DaveCTurner April 6, 2023 09:04

DaveCTurner reviewed Apr 6, 2023

View reviewed changes

n1v0lg reviewed Apr 6, 2023

View reviewed changes

...g/elasticsearch/xpack/remotecluster/RemoteClusterSecurityLicensingAndFeatureUsageRestIT.java Show resolved Hide resolved

n1v0lg reviewed Apr 6, 2023

View reviewed changes

ywangd added 3 commits April 11, 2023 10:32

Merge remote-tracking branch 'origin/main' into rcs-proxy-mode-error-…

3713afa

…message

fix tests address feedback

ee7ea07

tweak

39c7a9c

ywangd requested a review from n1v0lg April 11, 2023 07:11

ywangd added 3 commits April 11, 2023 17:58

revert unnecessary changes

7178faf

Merge remote-tracking branch 'origin/main' into rcs-proxy-mode-error-…

313c64b

…message

tweak

32ae73e

n1v0lg approved these changes Apr 19, 2023

View reviewed changes

ywangd added 2 commits April 20, 2023 11:32

Merge remote-tracking branch 'origin/main' into rcs-proxy-mode-error-…

352bdb8

…message

address feedback

e0595f6

Merge branch 'main' into rcs-proxy-mode-error-message

9413143

Merge branch 'main' into rcs-proxy-mode-error-message

108f723

ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Apr 24, 2023

Merge branch 'main' into rcs-proxy-mode-error-message

e6b47ed

Merge branch 'main' into rcs-proxy-mode-error-message

b70ed73

elasticsearchmachine merged commit 4d8d412 into elastic:main Apr 25, 2023

ywangd deleted the rcs-proxy-mode-error-message branch April 25, 2023 05:17

ywangd mentioned this pull request May 10, 2023

Enforce API key type on authentication #95908

Merged

gwbrown mentioned this pull request Aug 31, 2023

[CI] ProxyConnectionStrategyTests testProxyStrategyWillResolveAddressesEachConnect failing #99113

Closed

ywangd mentioned this pull request Sep 1, 2023

Do not report failure after connections are made #99117

Merged

Conversation

ywangd commented Apr 4, 2023

Uh oh!

elasticsearchmachine commented Apr 4, 2023

Uh oh!

elasticsearchmachine commented Apr 4, 2023

Uh oh!

elasticsearchmachine commented Apr 4, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd commented Apr 6, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywangd commented Apr 17, 2023

Uh oh!

n1v0lg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

n1v0lg Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd commented Apr 20, 2023

Uh oh!

ywangd commented Apr 21, 2023

Uh oh!

ywangd commented Apr 24, 2023

Uh oh!

ywangd commented Apr 24, 2023

Uh oh!

ywangd commented Apr 24, 2023

Uh oh!

ywangd commented Apr 25, 2023

Uh oh!

ywangd commented Apr 25, 2023

Uh oh!

ywangd commented Apr 25, 2023

Uh oh!

ywangd commented Apr 11, 2023 •

edited

Loading

n1v0lg Apr 20, 2023 •

edited

Loading